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INTRODUCTION 


In  spite  of  the  tremendous  progress  in  computer  technology,  speech  recognition 
remains  one  of  the  most  difficult  task  for  computers  to  accomplish.  Various  factors 
contribute  to  this  complexity  which  include  speaker  variability,  large  data  samples,  and 
excessive  computational  overhead  in  detection.  Most  of  the  speech  recognition  sys¬ 
tems  primarily  consist  of  two  major  phases:  time-invariant  speech  feature  extraction 
and  detection. 

Variability  from  speaker  to  speaker  is  due  to  variations  in  loudness,  rate,  and 
dialect.  The  loudness  problem  can  usuaiiy  be  taken  care  of  with  amplitude  normaliza¬ 
tion.  Solutions  to  the  problem  of  rate  or  time  have  been  attempted  using  time  nor¬ 
malization  (time-warping)  (refs  1  and  2).  The  variability  of  speech  due  to  dialect  is  very 
difficult  to  manage.  Various  measures  have  been  attempted  to  alleviate  this  problem 
such  as  using  average  or  multiple  templates  as  a  reference  instead  of  a  single  template. 
Averaging  the  templates  results  in  extraction  of  primary  features  of  the  speech  signal.  A 
detector  which  uses  averaged  templates  is  therefore  less  sensitive  to  minor  local  varia¬ 
tions  due  to  dialect.  The  recognition  system  then  responds  only  to  principal  features  of 
the  whole  word  and  becomes  immune  to  local  micro  changes.  Therefore,  the  system 
becomes  more  robust.  Most  of  the  averaging  procedures  used  in  earlier  studies  are 
time-domain  techniques.  Averaging  in  time  is  quite  difficult  due  to  problems  in  exact 
identification  of  the  end  points.  Using  multiple  templates  eliminates  the  need  for 
averaging  but  slows  down  the  conventional  speech  recognizer  due  to  the  increased 
number  of  patterns  the  system  must  examine  before  determining  which  word  was 
spoken. 

The  second  problem  is  the  large  amount  of  data  needed  for  processing.  The 
spectral  range  of  speech  lies  approximately  between  60  Hz  and  4  kHz.  With  a  sampling 
frequency  of  8  kHz  and  word  lengths  as  /<nr  as  1  sec,  the  data  sample  will  contain 
8,000  points.  Substantial  data  reductions  I-  been  achieved  through  Linear  Predic¬ 
tive  Coding  (LPC)  (ref  3)  and  Short-term  Fourier  Analysis  (ref  4).  Further  data  reduction 
has  been  achieved  through  vectorization  by  replacing  vectors  with  simple  indexes  (refs 
4  and  5). 

Finally,  detection  or  finding  the  distance  between  the  reference  template  and  the 
input  has  been  one  of  the  major  problems  in  speech  recognition  due  to  excessive 
computational  overhead,  especially  for  systems  with  a  large  vocabulary.  The  following 
section  describes  a  typical  conventional  speech  recognition  system. 
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CONVENTIONAL  SPEECH  RECOGNITION  SYSTEMS 


A  block  diagram  of  a  typical  conventional  speech  recognition  system  is  shown  in 
figure  1 .  This  involves  three  basic  steps: 


1 .  Speech  coding  or  feature  extraction:  Computes  spectral  coefficients  every 
10  or  so  milliseconds  using  either  a  Fast  Fourier  Transform  (FFT)  or  Linear  Predictive 
Coding  (LPC). 

2.  Time-Warping:  Computes  local  frame-to-frame  distances  and  uses  these 
local  distances  to  time  align  input  sequences. 

3.  Detection:  Computes  the  whole  word  matching  score  using  the  local 
distances  to  the  corresponding  reference  work  templates. 


Figure  1 .  Three  basic  steps  involved  in  conventional  speech  recognition 

The  problem  with  this  system  is  the  assumption  that  acoustic  clues,  in  a  given 
word/signal,  appear  in  a  precise  time  sequence;  this  is  an  erroneous  assumption  (ref  6). 
For  a  robust  speech  recognition  system,  it  is  essential  to  look  for  global  word  clues 
rather  than  local  peaks  and  valleys.  This  is  possible  only  if  the  analysis  is  based  on 
entire  word  data  rather  than  local  data  sets.  A  indicated  earlier,  the  detector  is  comput¬ 
ationally  intensive.  This  computation  is  especially  time  consuming  for  large  vocabulary 
systems  since  the  input  signal  has  to  be  compared  with  every  reference  template.  The 
neural  network  based  prototype  system  described  here  addresses  both  the  problem  of 
global  feature  extraction  and  the  need  for  an  improved  detector  to  alleviate  the  problem 
of  extensive  computation. 
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NEURAL  NETWORK  BASED  SPEECH  RECOGNITION  SYSTEM 


A  block  diagram  of  the  neural  network  based  speech  recognition  system  is  shown  in 
figure  2.  The  speech  recognizer  developed  is  neural  network  based. 


/I\ 


Figure  2.  A  neural  network  based  speech  recognition  system 

The  system  primarily  consists  of  two  sections:  a  time-invariant  speech  coder  and  a 
neural  network  based  detector. 


Time-Invariant  Speech  Coding 

This  section  performs  spectral  feature  extraction  for  a  whole  word.  It  is  a  2k-point 
FFT  program,  giving  a  spectral  range  approximately  between  2  Hz  and  4  kHz.  Each 
input  was  sampled  at  8  kHz  with  a  total  sampling  time  of  0.50  sec.  Most  of  the  words 
are  less  than  0.50  sec  duration.  Whole  word  processing,  unlike  in  conventional  sys¬ 
tems  where  a  group  of  local  data  sets  are  used,  eliminates  the  need  for  time-warping 
and  end-point  detection.  The  first  half  of  the  2k-point  FFT  was  averaged  over  each  8 
consecutive  points  to  give  a  compressed  data  of  256  points.  This  was  needed  to 
resolve  some  of  the  memory  problems  during  computer  simulation  of  the  neural  network 
detector. 

Neural  Network  Detector 

The  detector  was  designed  using  a  back-propagation  (BP)  neural  network  (refs  7 
and  8).  The  network  was  trained  for  a  set  of  11  words  used  for  controlling  the 
UNIMATE  PUMA-560  robot.  The  word  set  used  was:  Stop,  Start,  Exit,  Forward,  Back¬ 
ward,  Right,  Left,  Up,  Down,  Open,  and  Close. 
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In  norma!  test  mode,  the  detector  generates  a  coded  equivalent  of  the  word 
recognized.  A  program  in  the  robot  controller  converts  this  coded  sequence  into  an 
appropriate  binary  code  for  the  robot  to  act.  This  coded  sequence  is  then  converted  to 
an  ASCII  text  string.  It  is  also  fed  to  a  speech  synthesizer  (Speech  Plus  Inc.;  Model: 
CallText  5050)  for  audio  reproduction  of  the  received  commands. 

The  detector  consists  of  a  multilayer  feed  forward  network  with  one  input  layer,  two 
hidden  layers,  and  one  output  layer.  The  network  has  256  processing  elements  (PEs) 
in  the  input  layer,  20  PEs  in  each  of  the  hidden  layers,  and  11  processing  elements  in 
the  output  layer.  The  number  of  PEs  in  the  input  layer  is  equivalent  to  the  number  of 
outputs  used  from  the  FFT  processor;  the  PEs  in  the  hidden  layers  were  chosen  as  a 
compromise  between  speed  of  training  and  representational  power.  The  output  layer 
consists  of  1 1  PEs  corresponding  to  the  number  of  bits  required  to  represent  the  maxi¬ 
mum  number  of  words  in  the  test  vocabulary.  Details  of  the  NN-based  detector  are 
included  in  the  following  section. 

Back-Propagation  Network  Topology 

A  topological  description  of  the  NN-speech  detector  showing  the  number  of 
layers  used,  number  of  PEs  in  each  layer,  the  type  of  transfer  functions  used,  and  the 
learning  rules  for  each  layer  is  shown  in  figure  3.  The  description  of  the  transfer  func¬ 
tions  and  algorithms  follow  the  topological  description  with  the  following  used  as  the 
defaults: 

Summation  Fn  (SF) 

Scale 
Offset 

Output  Fn  (OF) 

Bias  (B) 
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=  None 

Learning  Rule  (LR)  =  Cumulative  Delta 

=  1.0 

High  Limit  (HL)  =  +1 

=  0.0 

Low  Limit  (LL)  =  -1 

=  Direct 

Transfer  Fn  (TF)  =  Sigmoid  Fn 

=  1 

TF  =  Linear 
LR  =  Delta  Rule 


TF  =  sigmoid 
LR  =  Delta  Rule 


TF  -  sigmoid 
LR  =  Delta  Rule 


TF  =  Linear 


t  Y 

#1 1  Output  layer 


Fully  connect 


- >. 

Randomize  ( -0.1,  +0.1) 

|#20  Hidden  layer  2 

Fully  connect 
Randomize  (-0.1 ,  +0.1 ) 

#20  Hidden  layer  1 

i 

1  Fully  connect 
Randomize  (-0.1,  +0.1) 

0  |  #256  Input  layer 


f  * 


Figure  3.  Back-propagation  neural  network  topology 

Transfer  Functions 


o  Sigmoid 

I  *  Gain.  1 

y  =  (1+e  ) 

o  Linear 


y  =  I 


where 


I  is  the  weighted  sum  of  the  inputs  and  is  given  by  Ij  =  X  Xi*Wji 

Training  Algorithm 

Cumulative-Delta  Rule:  All  patterns  are  concurrently  presented  to  the 
network.  Weights  are  updated  after  each  pattern  is  applied  and  the  error  corresponding 
to  that  pattern  is  computed  (Tssp).  When  the  weights  are  updated  for  each  pattern 
once  it  is  called  an  epoch.  The  network  must  go  through  many  epochs  before  it  be¬ 
comes  trained.  The  error  is  computed  as  the  sum  of  the  squares  of  error  of  all  the 
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patterns  (Tss).  Training  stops  when  this  cumulative  error  is  below  a  specified  thresh- 
hold  (TssTh).  When  Tss  reaches  indicated  TssTh  the  network  is  considered  trained. 
This  algorithm  is  presented  in  a  step-wise  fashion  as  follows: 

Cumulative  Back-Propagation  Algorithm 

1.  Randomize  all  weights  -0.1  and  +0.1.  Set  number  of  patterns  (P) 
and  cumulative  error  to  zero  (P  =  0,  Tss  =  0). 

2.  For  a  given  input  xp,  computed  output  yp,  and  desired  output  d  , 
apply  input  xp  and  compute  y 

3.  Based  on  (d  -  y)  update  weights,  compute  y,  Tssp,  Tss. 

Compute  error  at  the  output  layer  as: 

5  at  the  output  layer  =  8s  + 1  =  (d  -  y)  y  (1  -  y);  for  TF  =  Sigmoid 

=  (d  -  y)  *  dy.dl 

Compute  errors  for  the  hidden  layers  as: 

5  for  the  hidden  units  =  8*  =  yk  ( 1  -  yk )  Z5k  + 1  *Wj; 

k 

and  update  weights  using: 

Wjj  =  W*  +  e  *  5s  + 1  *  X* 

Use  graded  training  coefficient  (e  ->1.0  to  0.1)  depending  on  Tss 

(10->0.1). 

Compute  Tssp  as:  Tssp  =  I(di  -  yi)2i  =  1,2,...N;  where  N  is  the 
number  elements  in  the  output  layer  and  Tss  as:  Tss  =  Tss  +  Tssp 

4.  Set  P  =  last  pattern  number  trained 

5.  If  Tss>TssTh  for  any  pattern  go  to  step  2;  else  END. 

More  dot?  ^  on  BP  neural  network  can  be  found  in  references  7  and  8. 
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TEST  RESULTS 


The  speech  recognition  system  was  integrated  into  the  PUMA-560  robot  control 
loop;  testing  of  the  network,  however,  was  done  stand-alone  in  a  noisy  (electrical  and 
acousta!)  environment.  As  expected,  it  was  capable  of  learning  all  patterns.  Operating 
in  speaker-dependent  mode,  it  performed  with  an  error  rate  of  less  than  12.7%.  This 
high  value,  however,  compares  very  favorably  with  an  excellent  conventional  speech 
recognition  system  called  VocalLink  from  Interstate  Voice  Products.  This  commercial 
system,  tested  under  the  same  high  noise  conditions,  exhibited  an  error  rate  of  14.5%. 


CONCLUSIONS 

A  speech  recognition  system  was  developed  by  using  a  neural  network  for  a  detec¬ 
tion  stage.  This  stage,  when  implemented  in  hardware,  will  provide  almost  instant 
detection  irrespective  of  the  number  of  templates.  This  is  contrary  to  the  conventional 
techniques  where  detection  is  very  computationally  intensive  especially  with  a  large 
vocabulary.  This  research  also  addressed  another  major  concern  in  speech  recognition 
systems  namely  that  of  time-warping.  Globa1  Fast  Fourier  Transform  transformations  of 
the  whole  word  provide  automatic  time  warping  and  seems  to  perform  better  than 
conventional  time-warping  techniques  and  also  eliminates  the  problem  of  end-point 
detection.  Currently,  the  system  is  computer  simulated;  for  real-time  application  the 
speech-coding  part  has  to  be  hard-wired.  The  concept,  however,  works  well. 


RECOMMENDATIONS 

To  make  the  neural  network  detection  more  robust,  it  is  preferable  to  develop  the 
set  of  training  templates  at  different  times.  The  system  also  exhibits  superior  robust¬ 
ness  to  speech  variations  when  training  is  done  with  individual  words  from  a  training 
sample  set  instead  of  single  averaged  word  templates.  Additional  methods  of  speech 
detection  anc  preprocessing  should  be  examined  with  the  network  performing  the 
detector  fusion 
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